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Abstract 

Suppose a string Jf" — {Xi, X2, . . . , Xn) generated by a memoryless source (X„)„>i with distribution P is 
to be compressed with distortion no greater than D > 0, using a memoryless random codebook with distribution Q. 
The compression performance is determined by the "generaHzed asymptotic equipartition property" (AEP), which 
states that the probabiHty of finding a D-close match between X" and any given codeword y/', is approximately 
2-nR{p,Q,D) ^ where the rate function i?(P, Q, D) can be expressed as an infimum of relative entropies. The main 
purpose here is to remove various restrictive assumptions on the validity of this result that have appeared in the 
recent literature. Necessary and sufficient conditions for the generalized AEP are provided in the general setting 
of abstract alphabets and unbounded distortion measures. All possible distortion levels 13 > are considered; the 
source (X„)„>i can be stationary and ergodic; and the codebook distribution can have memory. Moreover, the 
behavior of the matching probability is precisely characterized, even when the generalized AEP is not valid. Natural 
characterizations of the rate function Q, D) are established under equally general conditions. 

Index Terms 

Rate-distortion theory, data compression, large deviations, asymptotic equipartition property, random codebooks, 
pattern-matching 

I. Introduction 

Suppose a random string X" = (Xi, X2, . . . , X„) produced by a memoryless source (X„)„>i with 
distribution P on a source alphabet S, is to be compressed with distortion no more than some D > 
with respect to a single-letter distortion measure p(a;,?/)E] The basic information-theoretic model for 
understanding the best performance that can be achieved, is the study of random codebooks. If we generate 
memoryless random strings = (Fi, I2, • • • , Yn) according to some distribution Q on the reproduction 
alphabet T, we would like to know how many such strings are needed so that, with high probability, we 
will be able to find at least one codeword that matches the source string X" with distortion D or less. 
The crucial mathematical problem in answering this question is the evaluation of the probability that a 
given, typical X", will be Z^-close to a random F". This probability can be expressed as 

Prob{Fi" G i?„(Xr, D) I xn = Q^{B^{X^, D)) (1) 

where i?„(Xj^,D) denotes the "distortion ball" consisting of all reproduction strings that are within 
distortion D (or less) from X"; note that the matching probability in ([T]) is itself a random quantity, as it 
depends on the source string X". 

The importance of evaluating ([T]) was already identified by Shannon in his classic study of rate-distortion 
theory [15], where he showed that, for the best codebook distribution Q = Q*, we have, 

g™ (5„(Xi", D)) ^ 2-"^(-f''^) (2) 

where R{P, D) is the rate-distortion function of the source. 

This work was supported in part by a National Defense Science and Engineering Graduate Fellowship. The material in this paper is 
preceded by a technical report [8]. Preliminary results were presented at [9]. 
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'Precise rigorous definitions are given in the following section. 
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The more general question of evaluating the matching probability ([T]) for distributions Q perhaps 
different from the optimal reproduction distribution Q*, arises naturally in a variety of contexts, in- 
cluding problems in pattern-matching, mismatched codebooks, Lempel-Ziv compression, combinatorial 
optimization on random strings, and others; see, e.g., [20] [13] [18] [12] [19] [4] [17] [2] [16], and the 
review and references in [5]. In this case. Shannon's estimate is replaced by the so-called "generalized 
asymptotic equipartition property" (or generalized AEP), which states that, 

--\ogQ^{B^{X'lD)) ^ R{P,Q,D) a.s. (3) 
n 

where "a.s." stands for "almost surely" and refers to the random string X". The rate function R{P, Q, D) 
is defined in a way that closely resembles the rate-distortion function definition, 

R{P,Q,D) := mfH{W\\P x Q) 
w 

where H{-\\-) denotes the relative entropy, and the infimum is over all (bivariate) probability distributions 
of random variables (U, V) with values on S and T, respectively, such that U has distribution P and the 
expected distortion E[p(U,V)] < D. (For a broad introduction to the generalized AEP, its applications 
and refinements, see [5] and the references therein.) 

The study of the rate function R{P, Q, D) and its properties is an important step in understanding the 
generalized AEP. In terms of lossy data compression, it is not hard to see that -R(P, Q, D) is equal to the 
compression rate achieved by a (typically mismatched) random codebook with distribution Q. In view of 
this, it is not surprising that the rate-distortion function turns out to be equal to R{P,Q*,D), when the 
codebook distribution is chosen optimally, 

R{P,D) = miR{P,Q,D) 

Q 

with the infimum being over all probability distributions Q on the reproduction alphabet T. Another 
important and useful observation made by various authors in the recent literature is that R{P, Q, D) can 
alternatively be expressed as a convex dual. 

Although much is known about the generalized AEP and about R{P, Q, D) [5], all known results are 
established under certain restrictive conditions. In most cases the codebook distribution is required to be 
memoryless, and when it is not, it is assumed that the distortion measure is bounded. Moreover, only 
distortion levels in a certain range are considered, and the case when 

D = D,^UP, Q) ■= inf{/^ : RiP, Q, D) < oo} 

is always excluded. 

The main point of this paper is to remove these constraints, and to analyze which (if any) are essential 
for the validity of the generalized AEP. Our motivation is twofold. On one hand, unnecessarily stringent 
conditions make the theoretical picture incomplete. On the other, there are applications which naturally 
require more general statements. For example, in the study of universal lossy compression, where the 
source distribution is not known a priori, how can we assume that the distortion value chosen will be in 
the appropriate range and will not coincide with -Dmin? (Specific applications of the results in this paper 
to central problems in universal lossy data compression will be developed in subsequent work.) Similarly, 
the usual constraints on the distortion measure may fail to hold even for some basic distortion measures, 
like squared error distortion in the case of continuous alphabets. And the lack of information about the 
generalized AEP dX D = -Dmin makes it difficult to draw tight correspondences between lossy and lossless 
compression, cf. [5]. 

Thus motivated, we give necessary and sufficient conditions for the generalized AEP in ([3]), and we 
precisely characterize the behavior of the matching probability in the pathological situations when the 
generalized AEP fails. Our results hold for all values of D, and they cover arbitrary abstract alphabets and 
distortion measures. We also allow the source to be stationary and ergodic, and the codebook distribution 
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to have memory. We similarly extend the characterization of the rate function R{P, Q, D) to the same 
level of generality. We show that it can always be written as a convex dual, and that a minimizer W in 
the definition of R{P, Q, D) always exists (unless, of course, the infimum is taken over the empty set). 

Sections |II] and |ni] contain the main results. Section |IV] contains generalizations to the case when the 
codebook distribution has memory. The bulk of the paper is devoted to proofs, which are collected in 
Section |Vl Our main mathematical tool is a generalized, one-sided version of the Gartner-Ellis theorem 
from large deviations. It is stated and proved in Section and it may be of independent interest. Finally, 
the important special case when D = D^am is analyzed using results about the recurrence properties of 
random walks with stationary increments. 



II. Characterization of the rate function 

Let S be the source alphabet with its associated cr-algebra S, let (T, T) be the reproduction alphabet, 
and take p : S xT ^-^ [0, oo) to be a distortion measure. We only assume that (S, S) and (T, T) are Borel 
spaced and that p is a{S x T) -measurable. Henceforth, these a-algebras and the various product cr-algebras 
derived from them are understood from the context. We use the abbreviations r.v., a.s., i.o., l.sc, u.sc. and 
log for random variable, almost surely, infinitely often, lower semicontinuous, upper semicontinuous and 
logg, respectively. If U and V are r.v.'s and g(u) := Ef{u,V), we use the notation Evf{U,V) for the 
r.v. g{U). When U and V are independent, then Evf{U, V) = E[f{U, V)\U]. 

We write X and Y for two independent r.v.'s taking values in S and T, respectively, with X ~ P and 
Y Q. We use p to define a sequence of single-letter distortion measures p„ on x T", n > 1, by 

1 " 

Pn{x1,y'l) := - Vp(xfc,?/fc) 
n ^-^ 

k=l 

where := (xj, . . . , Xj). The dependence on p or p„ is suppressed in nearly all of our notation. We use 

:=KGT'^:PnK,2/r)</^} 

to denote the distortion ball of radius D around x". 

If is a probability distribution on S* x T, then we use Ws to denote the marginal distribution of W 
on S, and similarly for Wt- An important subset of probability distributions on S* x T is 

W{P, D):={W:Ws = P. E^uy)^wp{U, V) < D} . 

This subset comes up in the definition of the rate-distortion function 

R(P,D):= inf H(W\\WsxWt) 

WGW{P,D) 

which we take to be +oo when W{P,D) is empty. H^pWv) denotes the relative entropy (in nats). 

E^log^ ifp^^u. 



Hip\\u) 



OO Otherwise. 



Note that //(l^^Hiy^sxiy^-) is the mutual information between r.v.'s (U, V) with joint distribution W. 

Since H{W\\WsxWt) = miq H(W\\WsxQ), analysis of R{P,D) often proceeds by expanding the 
infimum into two parts, namely, 

R{P,D) = inf R{P,Q,D) 

Q 

R(P,Q,D):= inf H(W\\PxQ). 

WeW{P,D) 



^Borel spaces include R'' as well as a large class of infinite-dimensional spaces, including Polish spaces. This assumption is made so that 
we can avoid certain pathologies while working with random sequences and conditional distributions [10]. 
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The first infimum is over all probability distributions Q on T. Expanding the definition in this way is 
convenient, because R{P, Q, D) can be expressed as a simple Fenchel-Legendre transform. In particular, 
define 

A(P,Q,A) :=i?x[logi?ye^''(^'^)] 
A*(P, g, D) := sup [\D - A(P, Q, A)] . 

A<0 

Proposition 1: R{P, Q, D) = A*{P, Q, D) for all D. If W{P, D) is not empty, then this set contains a 
W such that R{P,Q,D) = H{W\\PxQ). 

This alternative characterization is well known (see [5] for a review and references). We state it as 
a proposition and prove it below because typically it is qualified by other assumptions on p and D. In 
particular, the case D = D^ia(P,Q) is almost always excluded, where 

/^min(P, Q) := inf : R{P, Q, D) < oo}. 

i?(P, Q, D) has two other important characterizations that arise in a variety of contexts. Let P^n denote 
the empirical distribution on S of x", let denote the n-times product measure of Q on T" and define 

L4xlQn,D) := -ilogQ„(P„(x^D)) 
for any probability distribution Qn on T". 

Theorem 2: If (X„)„>i is stationary and ergodic, taking values in S, with Xi ~ P, then 

liminfL„(Xf,Q",D) = P(P,Q,D) 

for all D. The result also holds with L„(Xi", Q", D) replaced by R{Px^,Q, D). 

Of course, if the limit exists, then the lim inf is the also the limit and Theorem [2] is what Dembo and 
Kontoyiannis [5] call the generalized AEP. There are, however, pathological situations where the limit 
does not exist. In the next section we give necessary and sufficient conditions for the existence of the 
limit and we analyze in detail the situation where the limit does not exist. 

III. The Generalized AEP 

Here and in the remainder of the paper we will always assume that (X„)„>i is stationary and ergodic, 
taking values in S, with Xi ~ P. DefineH 

Pq{x) := essinf p(x, y). 

We can exactly characterize when the lim inf is actually a limit in Theorem [2l 

Theorem 3: lim„ L„(X", Q", P*) does not exist with positive probability if and only if < P = 
Drain{P-,Q) < OO and R{P,Q,D) < oo and pq{Xi) is not a.s. constant. Furthermore, in this situation 

Prob{L„(Xi", Q", P) = oo i.o.} > (4a) 
Prob{L„(Xf , Q", P) < oo i.o.} = 1 (4b) 
lim Q^-, P) = P(P, Q, D) (4c) 

m— >oo 

where (A^m)m>i is the (a.s.) infinite random subsequence of (ra)„>i for which Ln{X^,Q'^,D) is finite. 
All of the above also holds with L„(X", Q", P) replaced by R{Px^,Q, D). 

^ The essential infimum of a random variable 77, is ess inf 77 := inf{r : Prob{r; < r} > 0}. 
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Combined with Theorem [2l this gives necessary and sufficient conditions for the generalized AEP. Both 
theorems are proven below. The proof shows that {Nm)m>i can also be (a.s.) characterized as the random 
subsequence for which 

1 " 

-J2pQ{Xk)<D. (5) 

k=l 

Note that D^^i^^P, Q) = Epq^Xi), whenever the former is finite. 

A simple example that illustrates the pathology is the following: Let {Xn)n>i be the sequence 1, 0,1,0,... 
with probability 1/2 and the sequence 0, 1, 0, 1, . . . with probability 1/2, namely, the binary, stationary, 
periodic Markov chain (which is ergodic). Let Q be the point mass at 0, let p{x,y) := \x — y\ and let 
D = 1/2. Note that Pq{Xi) = Xi is not constant, that D = -Dmm(-P, <5) = 1/2 and that R{P, Q,D) = 
is finite. In the case when Xi = 0, Ln{Xi,Q'^,D) = for all n. In the case when Xi = 1, however, 
L2n{Xf^,Q^''',D) = and L2n-i{X^''-\ Q^"-\ D) = oo for all n. 

IV. Extensions to the case with memory 

Although the source (X„)„>i can have memory, the generalized AEP stated thus far is restricted to the 
case where the reproduction distribution is memoryless, that is, is evaluated with a product measure 
Q". We relax this assumption here. 

Let P denote the distribution of (X„)„>i, which we continue to assume is stationary and ergodic with 
Xi ~ P. Let Q denote the distribution of a stationary random process (F„)„>i taking values in T with 
Yi ~ Q. We use P„ and Qn to denote the distributions of X" and F", respectively, which are assumed 
to be independent. The results stated so far assume that Q is memoryless, that is, Qn = Q^- 

For the results in this section, however, we assume that Q satisfies the following strong mixing condition: 

C-'Q{A)Q{B) < Q{A nB)< CQ{A)Q{B) 

for some fixed 1 < C < oo and any A G cr(Y'") and B E a{Y^^) and any n. Notice that this implies 
ergodicity and includes the cases where Q is memoryless (C = 1) and where Q is a hidden Markov 
model (HMM) whose underlying Markov chain has a finite state space with all (strictly) positive transition 
probabilities. For the special case of a finite state Markov chain, a formula for Roo(P, Q, D) not involving 
limits was identified in [18]. 

Following the definition of R{P,Q,D), define 

Rn{Pn,Qn,D):=- inf H{WjPnXQ^) 
n w„eH/„(P„,r>) 

where Wn{Pn, D) is the subset of probability distributions on 5*" x T" defined analogously to W{P,D) 
except with p„ instead of p. Also, let 5x^ be the probability distribution on 5" that assigns probability 
one to the sequence x". 

Theorem 4: Theorems |2] and |3] remain valid when is replaced by Qn, R{Px",Q,D) is replaced by 

RniSx^, Qn, D) and i?(P, Q, D) is replaced by i?oo(P, Q, D), where 

R^i¥M,D):= lim /?„(P„, g„, D). 

The existence of the limit in the definition of -Roo(P, Q, D) is part of the result. Define 

/^min(P, Q) := inf{D : R^{f>, Q, D) < oo}. 
Note that the mixing conditions here are strong enough to ensure that 

i^min(P,g) = Anin(P,Q) (6) 



6 



and that 

1 " 

essinf p„(x^, = - V pgixk) (7) 

k=l 

which is why the resuhs for memory can still be in terms of Dmin(P, Q) and pq. Extending Theorem |3] 
to situations where these do not hold seems difficult. The generalized AEP for Q with memory can also 
be found in [2], [3], [5] under more general mixing conditions but for bounded distortion measure p and 
for D ^ D^i„(P, Q). 
Define 

An(Pn,Qn,A) := Ex^ [\og Eyn e^P"^""" '^"^] 
A:(P„, Qn, D) := - sup [\D - AniPn, Qn, A)] . 

n A<o 

Proposition [T] immediately gives 

RniPn,Qn,D)=Al{P^,Q^,D) 

so -Roo(IP, Q, D) is the limit of a sequence of Fenchel-Legendre transforms. Analogous to the memoryless 
case, it can also be characterized directly as a Fenchel-Legendre transform. 

Proposition 5: Define 

Aoc(P,Q,A) := lim -A„(P„,Q„,nA) 

n^oo n 

A*^i¥, Q, D) := sup [XD - A^(P, Q, A)] . 

A<0 

Then R^{¥,Q,D)=Al,{F,Q,D). 

The existence of the limit in the definition of Aoo(lP, Q, A) is part of the result. Occasionally it is more 
convenient to rewrite 

XD - -A„(P„,Q„,nA) 



A;(P„,Q„,D) =sup 



A<0 



n 



(8) 



This form makes it easy to show that P„(P„, Q", D) = R{P, Q, D) and that P„(5x-, Q", D) = R{P^^,Q, D), 
so that whenever Q is memoryless, Roo(P, Q, D) = R{P, Q, D) and all the results coincide. 

V. Proofs 

The proofs occasionally refer to Ds^^eiP, Q) '■= Ep{X, Y) for independent X ~ P and F ~ Q. 

A. Properties of A and A* for arbitrary distortion measures 

A common assumption in the literature is that p is either bounded or satisfies some moment conditions, 
such as Ds^vciP, Q) < oo. Since we do not assume these things here, we need to reverify many properties 
of A and A* that can be found elsewhere under stronger conditions. These properties lead to the generalized 
AEP under the usual condition that D ^ -Dmin- More detailed proofs, including measurability issues, can 
be found in a technical report that preceded this paper [8]. 

In this section we will use the assumptions and notation from Section UIl however, we will suppress the 
dependence on P and Q whenever possible. In particular, we will think about A(A) := A(P,Q,X) and 
A*{D) := A*(P, Q, D) as functions of A and D, respectively. It is also convenient to temporarily redefine 

D^^^ := mf{D : A*{D) < oo} 
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until the end of this section where we prove Proposition [T] Proposition [T] shows that A*(D) = R{P, Q, D), 
so both definitions of -Dmin are equivalent. Note that everything in this section applies equally well to A„, 
A* and i?„ as defined in Section HVl 

We begin with the following Lemma which comes mostly from [6][Lem. 2.2.5, Ex. 2.2.24]. See also 
[5], [19]. 

Lemma 6: [6] Let Z he a real-valued, nonnegative random variable. Define 

r(A) := logEe^^. 

r is nondecreasing and convex. T is finite, nonpositive and C°° on (— oo, 0) with 

limr(A) = r(0) = and r'(A) = r^, A < 0. 

ATO ' ' Ee^^ 

V is finite, nonnegative and nondecreasing on (—00, 0) with 

lim r'(A) = essinf Z and limr'(A) = EZ. 

AT-00 ATO 

If essinf Z < EZ, then T is strictly convex on (—00, 0). 

Define r(A,x) := logEe^f^'^'^l For fixed X, we can apply Lemma [6] to the r.v. Z := p(x, Y) to 
get several regularity properties of r(-,x). It turns out that these regularity properties are preserved by 
expectations, i.e., they continue to hold for A(A) = ET(\,X). A sufficient condition is that A be finite 
on (—00,0]. This replaces the typical moment conditions on p. Note that if A*{D) is finite for some D, 
i.e., if -Dmin is finite, then this condition is trivially satisfied. 

Lemma 7: A is nondecreasing and convex. Suppose A is finite on (—00, 0]. Then A is nonpositive and 
on (-00,0) with limAToA(A) = A(0) = and 



A'(A) = Ex 



Eyp{X,Y)e 



Ap(x,y) 



^^eAp{X,Y) 

A' is finite, nonnegative and nondecreasing on (—00, 0) with 



A < 0. 



lim A'(A) = Epq{X) and limA'(A) = Dave- 

AT-00 ATO 

If Epq(X) < -Davc» then A is strictly convex on (— oo,0). 

Proof: The statements about A are trivial. We will focus on the properties of A' which follow more 
or less immediately from the convexity of A and the differentiability of r(-, x). Let A'_ and A^ be the left 
hand and right hand derivatives of A, respectively, which are finite for A < 0. The monotone convergence 
theorem immediately gives A'_(A) = ET'(\,X) for A < 0. (The same argument can be used as A t 0.) 
This shows that r'(A,X) has finite expectation and lets us use the dominated convergence theorem to 
get that A'|_(A) = ET'(\,X). (The same argument can be used as A | —00.) So the left and right hand 
derivatives of A are identical and have the given form. Recall that a differentiable, convex function has a 
continuous derivative. ■ 
These properties of A give the following well known properties of A*, which we state without proof, 
except for ®. See [6][Lem. 2.2.5] and [14][Thm. 23.5, Cor. 23.5.1, Thm. 25.1]. 

Lemma 8: A* is convex, l.sc, nonnegative, nonincreasing and continuous from the right. A* = 00 on 
(-00, Dmin) and A* = on [Dave, 00). li D < Dave, then A*{D) = supAgK[AD - A(A)]. If D^in < 00 
(so that Lemma |7] applies), then Dmin = Epq^X), A* is finite and on (Dmin, 00) and 

A*(Dmin) = Ex [-logDyl{p(X,r) = Pq{X)}] . (9) 
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If further Z^min < -Dave, then A* is strictly convex (and thus strictly decreasing) on (Dmin, -Dave) and for 
each D E (-Dmin, -Dave) thcrc exists a unique < such that A*(D) = \dD — A(A£)). 

Proof: We only prove Define 

y) := max{p(x, y) - Pq{x), 0} 
so that p is a valid distortion measure and so that 

p{x,Y) = p{x,Y) + pq{x). 
Let A be defined analogously to A, except with p instead of p. We have A (A) = A(A) + AZ^min so that 

A*(-Dmm) = sup AL)mm - A(A) - AAnin = hm -A(A) 



A<0 



Aj-oo 



log Ey( lim e^'^^^'^)^ 

\Ai-oo J 



Ai-oo 

Ex 



= Ex[-logEYl{p{X,Y) =0}] 

= Ex [- hgEytipiX, Y) = pq(X)}] . 

We moved the limit inside the expectations using first the monotone convergence theorem and then the 
dominated convergence theorem. ■ 
1 ) PropositionUl' Proposition [U is an immediate consequence of the next two lemmas. The proofs follow 
[5][Thm. 2] with minor modifications. Note that Proposition [T] and Lemma [8] imply that -Dmin = Epq^X) 
whenever the former is finite. 

Lemma 9: IfW e W{P,D), then H{W\\P^Q) > A*{D). 

Proof: Let ip : T (— oo, 0] be measurable. Then [5] 

H{Q\\Q) > Ey^Q^iV) - log Ee^^"-^ 

for any probability measure Q on T. Applying the previous inequality with ij{y) := \p{x, y), for A < 0, 
gives 

H{W{-\x)\\Q) > AEv.H^(.|,)p(x,y) -logEe^''(^'^) 

where denotes the regular conditional distribution of V given U = x for {U,V) ~ W. Taking 

expectations w.r.t. U and noting that W G W{P, D) gives 

H{W\\PxQ) = Eu^pH{W{-\U)\\Q) > XD - A{X). 

Optimizing over A < completes the proof. ■ 

Lemma 10: If k*{D) < oo, then there exists aW e W{P,D) with H{W\\PxQ) = A*{D). 

Proof: The proof makes frequent use of Lemma [H If D > -Dave, then A*{D) = and W := PxQ 
achieves the equality. If -Dmin < D < D^vc, then W defined by 

{x,y) :-- 



d{PxQ)^ ■ Ee^DP{xy) 
achieves the equality [5], where A^ is uniquely chosen so that A*(-D) = X^D — A{Xn)- 
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Finally, if D = 



Epq{X), then define W by 

dW , , \{y^A{x)\ 



{x,y) :-- 



d{PxQ)' Et{YeA{x)} 

where A{x) = {y : p{x, y) = Pq{x)}. Note that Lemma[8] shows that A*{D) = Ex [- log^yl{F G A{X)}] 
which we have assumed is finite, so the denominator is positive P-a.s. and W is well-defined. It is easy 
to see that W e W{P, D) and that 



H{W\\PxQ) = E 
= E 



t{Y e A{X)} 



dW dW 
{X, Y) log 



-E 



Ey[l{Y ^A{X)}] 
1{Y e A{X)} 



d{PxQ) 
logl{y e A{X)} 



iX,Y) 



\ogEy[l{YEA{X)}] 



[Ey [1{Y G A{X)}] 
0-Ex [logEyl{Y G A{X)}] = A*{D) 



which completes the proof. 



B. Extensions to memory 

Here we prove Proposition [5] and the claims in the text following Theorem |4l including the existence 
of R{F,Q, D), under the assumptions of Section |IVl The stationarity and mixing properties of Q give 
Q'^ <^ Qn ^ Q", vvhich proves ([7]), and they give 



C 



-1 



f{yrnQUdy:tT)Qn{dyi 



< I fiyrnQn+n^idy^n 



< C 



fiyrnQmidy:tT)Qnidy^) 



(10) 



for any function / > 0. We make use of this property repeatedly. Note that if / factors, i.e., if /(y"" 

9{yi)h{ynlT) for g,h>0, then ^ becomes 



This gives 



which implies that 



C-^Eg{Yl')Eh{Y^) < Ef(Y;'+"') < CEg{Yl')Eh{Y^). 



< F ^ p(n+m)Ap„+„«+™,y,"+'") 

\ JZ/y-n+mC 



An{5^n,Qn,nX) + Ara{5^n+m,Q^,m\) -logC 

< A„+m(5^^+™, Qn+m, {n + m)\) 

< An{Sx^,Qn,nX) + Ara{6^«+,r,,Qm,mX) + log C. 



(11) 



(12) 
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Replacing Xk with Xj. and taking expected values gives 

An{Pn,Qn,nX) + Am{Pm, Qm, mX) - \ogC 
^ A-n+m{Pn+'m: Qn+m: + Wl)A) 

< An{Pn,Qn,n\) + A„(P^, Q^, mA) +logC. 



(13) 



This final result implies several things. First, it shows that if A„(P„, Qn, nX) is finite (infinite) for some 
n, then it is finite (infinite) for all n. It also shows that the sequence A„(P„, riA) +log C is subadditive, 
so the limit in the definition of Aqo exists. In particular [10][Lemma 10.21], 

Aoo(P,Q,A) := lim -A„(P„,Q„,nA) 
= inf - [A„(P„, Qn, n\) + \ogC] 

n>N n 



for any > 0. This gives 



sup 

A<0 



sup 

n>N 



sup 

n>N 



\D - inf - [A„(P„, Qn, nX) + hgC] 

n>N n 



sup 

A<0 



XD - -An{Pn,Qn,nX) 

n 



\ogC 



n 



a:(p„,q„,d)- 



logC 



n 



The last equality follows from ([8]) which is easy to prove by moving the 1/n outside of the supremum 
and optimizing over nX instead of A. Since we always have 



A:,(P,Q,Z}) = sup lim 

A<0 



A/) - -A(P„,Q„,nA) 



n 



< lim inf sup 



AD - -A(P„,Q„,nA) 



n 



liminfA:(P„,Q„,D) 



we have also shown that 



D)= lim A:(P„,g„,D) 
= lim P„(P„,Q„,D) ■.= R{¥,Q,D). 

n— >oo 

This completes the proof of Proposition \5\ and shows that P(P, Q, D) exists. 
Lastly, (fT3l) shows that 



A(P,g,A)-logC< -A„(P„,g„,nA) < A(P,g,A) + logC 



n 



so A*(P, Q, D) - log C < A;(P„, Q„, D) < A*(P, Q, D) + \ogC. This gives ©• 
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C. A large deviations result 

For appropriate values of D, the generalized AEP is essentially a large deviations result. The next 
lemma summarizes what we need. It is basically a corollary of the Gartner-Ellis Theorem. Note that A 
and A* are redefined in this section. 

Lemma 11: Let (Z„)„>i be a sequence of nonnegative, real-valued random variables such that 

A(A) := lim -logEe"^^" exists 

for all A G M. Define A*{D) := supx<o l^D - A(A)]. Then 

limsup -logProb{Z„ < D} < -A*{D) 

n— >oo IT- 

for all D. Furthermore, if A* is strictly convex on (a, 6), then 

lim - logProb{Z„ < D} = -A*{D) 

for all D G {a,b]. 

Proof: For any A < 0, Prob{Z„ < D} < Ee^^^^'^-^K so 
lim sup — logProb{Z„ < D} 

< -AD + lim sup -log E"^^" = -[AD - A (A)]. 

n— >oo ^ 

Optimizing over A < gives the upper bound. 

Suppose A* is strictly convex on (a, b). Since A* is nonnegative and decreasing, A* must be finite and 
positive on (a, b). The finiteness implies that A is finite on (— oo, 0]. We will first show that 

A*(D) =sup[AD- A(A)] D<b. (14) 



It is easy to see that A is increasing and convex with A(0) = 0, so we can choose a < -D' < oo with 
A(A) > \D' for all A G M. If D' = oo, then A(A) = oo for A > and ([Ml) holds for all D. If D' is finite 
and D < D', then XD - A(A) < XD' - A(A) < for all A > 0, so d) holds for all D < D'. The same 
inequality gives A*{D') = 0, so 6 < D'. 

Now we will prove the lower bound. If A is finite in some neighborhood of zero, then the lemma 
follows immediately from the Gartner-Ellis Theorem as stated in [7][Thm. V.6]. If this is not the case, 
then we need to slightly modify the sequence (Z„) before applying the theorem. 

Fix D G (a, b] and choose < e < D — a. Let (Z„)„>i be a sequence of nonnegative, real-valued r.v.'s 
with distribution -P„(') := Prob{Z„ G •} defined by 

dP e"""^ 

dFj'^--=EP^ 

where P„(-) := Prob{Z„ G ■}. We have 

logProb{Z„<D} >logP„((D-e,D)) 

fD rp -neZn 
JD-e 6 

> logEe-"^^" + ne{D - e) + log P„,((^ - e, D)). 
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Taking limits gives 

lim inf — log Prob{Z„ < D} 

> A(-e) +eD-e2 + liminf- log P„((L' -£,£))). (15) 
We want to apply the Gartner-Ellis Theorem to the sequence (-P„)„>i. Note that 

/p—nez rp„n{\~e)Zn 
e"^^-r :^Pn{dz) = ^ 

^ 1 - 

A(A) := lim - log Ee""^^- = A(A - e) - A(-e) 

n^oo n 

exists and is finite for all A < e. In particular, it is finite in a neighborhood of 0. Note also that 

A*(x) := sup Aa;-A(A) = sup (A + e)x - A(A + e) 
= sup [Ax - A(A)] + ex + A(-e) = A*(x) + ex + A(-e) 



for any x < 6. So A* is also strictly convex on (a, b) and the slope of any supporting line to A* at a point 
in (a, b) is strictly less than e. In particular, the slope of such a point is in the interior of the domain 
where A is finite. So the assumptions of the Gartner-Ellis Theorem are satisfied and 

liminf-logP„((D-e,L>)) > - inf A*(x) 

n^oo n x€{D-e,D) 
xG(D — e,D) 

>- inf [A*(x) + eD + A(-e)] 

xe{D-e,D) 

= ~A*{D) -eD - A{-e). 
Combining this with (fT5l) gives 

liminf-logProb{Z„ < D} > -A*{D) - el 

n— »oo 77, 

Since e was arbitrary, this completes the proof. ■ 

Lemma 12: Let Z he a real-valued, nonnegative random variable. Define A*{D) := sup;^<q[A-D — 
logEe^^]. Then 

logProb{Z <D} < -A*{D) 
with equality for D < ess inf Z. Furthermore, logProbjZ < D} is finite if and only if —A*(D) is finite. 
Proof: For any A < 0, logProbjZ < D} < —[\D — logi^e'^^]. Optimizing over A < gives the 

a.s. 

first bound. Suppose D < ess inf Z so that Z — D > 0. In this case 

Prob{Z <D} = Prob{Z = D} = lim Ee^^^'^^ 

A— >— oo 

= inf Ee^(^-^) 

A<0 

and 

logProb{Z <D} = inf [logEe^^ - XD] = -A*{D). 
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Of course, if D > essinf Z, then — oo < logProbjZ < D} < —A*{D) < 0, and everything is finite. ■ 
Corollary 13: Lemma [TTl holds if log Prob{Z„ < D} is replaced by — A*(Z}), where 

A;(D) := isup [XD - logEe^^"] . 
n A<o 

Proof: — A* (D) < —[nXD — \ogE'^^^"]/n. Taking limits and optimizing over A < gives the upper 
bound 

limsup-A;(D) < -h*{D). 

n— >oo 

Lemma [121 shows that 



liminf-A;(D) > liminf - log Prob{Z„ < D}, 

n— >oo n— >oo fl 

which gives the lower bound in the second part of Lemma [TTJ ■ 
D. The generalized AEP 

Now we will prove the main theorems in the text. We focus on the more general setting with memory 
described in Section |IV] since this includes the memoryless situation as a special case. The main idea is 
to fix a typical realization (x„)„>i of (X„)„>i and then analyze the behavior of the sequence of r.v.'s 
{Zn)n>i, where 



1 

l,Y^):=-y^p{xk,Yk) (16) 



n 

k=l 



and where (F„)„>i has distribution Q. Using this terminology, 

Ln{x\,Qr,,D) = --logProb{Z„ < D} 

n 



and 

Rn{Sx^,Qn, D) = A* (5a;n, Q„, D) 

:= - sup [XD - logEe^^"] . 
n A<o 

The proof proceeds in several stages. Proposition [5] allows us to use A^(P, Q, D) instead of -Roo(IP, Q, D). 
We first prove the lower bound 

liminf L„(Xr,Q„,D) > A*^{WM,D) (17) 
for all D. Then we prove the upper bound 

limsupL„(Xr,Q„,D) < A:,(P,Q,D) (18) 

n— >oo 

separately for the cases D < D,nmiP,Q), D > D^^c{P,Q) and D^i^{P,Q) < D < D^^c{P,Q)- The 
case D = D^i^{P,Q) can be pathological in certain situations. For these situations we characterize the 
pathology as described in Theorem [3] (extended to the situation with memory). Note that even in the 
pathological situation when the limit does not exist, there is a subsequence along which the upper bound 
in (fTSi) holds. This gives Theorem [2] (extended to the situation with memory). Finally, Lemma [T2l allows 
us to replace L„(X", Q„, D) with Rn{5x^, Qn, D) along the lines of Corollary [131 even in the pathological 
situation. 
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1) The lower bound: ([72)) shows that we can apply the subadditive ergodic theorem [10] [Theorem 
10.22] to 

An(5x{S<5n,^A) + logC 

for A < or to 

-An{5x^,Qn,n\) + logC 
for A > (so that everything is bounded above by log C) to get 

lim -A„(5xr,Qn,nA) = Aoo(P,Q,/^). (19) 

n— >oo n ^ 

The right side is a constant because the limit is shift-invariant and the source is ergodic. Since A„ is 
increasing in A, we can choose the exceptional set independently of A. 

Choosing (x„)„>i so that ([T9l) holds and defining (Z„)„>i as in (fT6l) allows us to apply the first part 
of Lemma [TT] to get the lower bound ([TtI) . Note that Corollary [13] gives the same lower bound for 

Rni^X"^-, Qn, D). 

2) The upper bound when D < -Dmin or D > -Dave-' When A*(¥,Q, D) = oo, the lower bound ([17] ) 
implies the upper bound ([T8]) . Note that this includes all D < Dmin(P, Q) and possibly some situations 
where D = D^in(P, Q). 

If Di^^dP, Q) is finite and D > Di^^dP, Q), then Chebyshev's inequality and the ergodic theorem give 

L„(Xr, Q„, D) = -- log [1 - g„ {y^ : p^Xf, y^) > D}] 
n 



1 , 

< log 

n 



^0< A:,(P,Q,D) 

as n oo, since i?ynp„(X", F") ^ Dg,^(,{P,Q) < D. This gives the upper bound ([T8]) for the case 

3) The upper bound when -Dmin < D < Da.ve-' Assume that -Dmin := -Dmin(P, Q) < -D < -Dave(-P, Q) '■ = 
-Dave- If A^(P, Q, ■) is known to be strictly convex on (-Dmin, -Dave), then we could apply the second part 
of Lemma [TT] in the same manner as Section [V-D II to get the upper bound on (-Dmin, -Dave]- Unfortunately, 
we were unable to find a simple proof of this strict convexity. Instead we will apply Lemma [TT] to an 
approximating sequence of random variables {Zn)n>i- 

Fix m G N. Let Q denote the distribution of a random process {Yn)n>i taking values in T with the 
property that Y^^^^_^-^^ has distribution Qm and is independent of all the other Ffc's. We use Qn to denote 
the distribution of Y"". If n = mi + r, 1 < r < m, then Qn = {^i=iQm) x Qr and 

C-'QniA) < Qn{A) < C'QniA). (20) 

The next Lemma summarizes how Q behaves in our context. 

Lemma 14: Fix m G N and define Q as above. Then 

Aoo(5x?°,Q, A) := lim -A„(5v", Qn, nA) 

n^oo n ^ 

= -A^(P^(-|J),g^,mA) (21) 
m 

exists and has the above representation for all A G M with probability 1, where Pmi-\I) is a random 
probability distribution on S'"' depending only on the sequence X^. Furthermore, 

Al,{5x^,Q, D) ■= sup \\D - A^i5x^,Q, A) 

A<0 L 
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is strictly convex in D on (-Dmin; -Dave) and 

KU5x^ , Q, D) - < A:,(P, Q, D) 

<A:,(5xj«,Q,^) + ^ (22) 

for all D with probability 1. 

Proof: To simplify notation, fix A and define the r.v. 

A„ := K{5x^,Qn,nX). 

We will first show that the convergence of A„/n is a.s. determined by the convergence of the subsequence 

^mi/ijn^) as £ — > CXD. 

The ergodic theorem gives 

1 

-5^A(5x„g,A)^A(P,g,A). (23) 

fc=i 

Analogous to the arguments in Section IV-B[ 

n 

-A„ e - V A(5x, , g, A) ± log a (24) 

fc=i 

If A(P, g,A) is infinite, then (|23] ) and (l24l) show that lim„A„/n exists and is infinite a.s. In particular, 
lim„ A„/ n = lim^ Ami/ (mi) . 

If A(P, g, A) is finite, then ([23]) shows that 

-A(5x„,g,A)^0 
n 

which implies that 

-A,(5x„" ^^^,g.,rA)^0 (25) 
for each r; see (fT2l) . Writing n = mi + r for 1 < r < m, the block-independence property of Q gives 

An = Ami + Ar{6x^^^^,Qr, rX). 

Combining this with (|25] ) shows that An/n has a.s. the same asymptotic behavior as A„ii/{mtj. 

We will now analyze the limiting behavior of Ami/{mtj. The block- independence property of Q gives 

1 ^ 1 ^ 

— -Amt = — -y^Am{5xmk ^ ,Qm,mX). (26) 
mi mi ^-^ m(fc-i)+i 

k=\ 

The sequence (-^m(Vi)+i)^>i °f disjoint m-blocks from (X„)n,>i is stationary (but not necessarily ergodic), 
so the ergodic theorem [10, Theorem 10.6] gives 

1 ^ 

7 5]^™'^^^r"ri. ,g„^,mA) 

£^00 t (fc-l)m + l 

fc=l 

= £;[A„(5xr,gm,^A)|j] (27) 

where X is the shift invariant cr-field for the sequence {}^mlt-\')m.^\)t->y - Letting Pm(-|2^) denote the regular 
conditional distribution of given X, the right side of (|T7l) is Am(Pm(-|X), g^, "^A). 

Combining (|26l ) and (1271 ) and recalling our discussion about the subsequence {mi)i^i shows that (|2TI) 
holds a.s. for each specific A. Since A„ is increasing and since X does not depend on A, we can choose 



16 



the exceptional set independently of A. This implies that the corresponding is a.s. well-defined and 
the exceptional set does not depend on D. 

Two applications of the ergodic theorem show that 



1 

Dave = lim EY^p{Xk,Yi) 

n-+oo 72 ' 
k=l 

^ ( m 

k=l j=l 

1 ^ 

= 7 -^y"rPm(-^(t^i)m+i' 



k=l 



'=E[EY^r.p^{xr,Yn\i] 

= Ex^^P^i-U) [EY^r.p^{xr, rn] . (28) 
An identical argument, combined with dV]), gives 



-Dmin — -E'x™~P„(-|I) 



essinfp™(Xr,F," 

^1 



(29) 



Because of the representation on the right side of (|2TI) . we can apply Lemma [8] with S = S™, T = T'", 
p = Pm, X Pmi-\I), and Y Qm to see that A*^{Sx^,Q,-) is strictly convex on (/^min,^avc) a.s. 
Identifying the -Dmin and Dave from Lemma [8] with -Dmin and _Davc here follows from (|29l) and (l28l) above. 

Finally, analogous to the arguments in Section IV-B[ (l20l) gives 



?2 

A„(5x?,Qn,^A) logC < A„(4n,Q„,nA) 

< A„(5^.n,(5„,r2A) H logC. 



Combining this with (ED and ([T9l) gives (1221) • ■ 
Returning to the main argument, fix a realization (x„)„>i of (X„)„>i so that everything holds in 
Lemma [141 Define the sequence of random variables (Z„)„>i and (Z„)„>i by Z„ := p„(a;",F") and 
:= pnixi,Y;'). (I2Q1) shows that 

L^{x'l,Qn,D) = -ilogQ„(5„(a;^D)) 

<--logg„(i?„(x^D)) + i^ 
72 m 

= --logProb{Z„ <D} + 

n m 

Lemma [Ml lets us apply the second part of Lemma \TT\ to the right side to get 

lim sup Q„, D) < A*^{6x^,Q, D) + ^ 

< A:,(P,Q,D) + 2i^ 
m 

for all D E (Dmin, -Dave]- The final inequality comes from ([221) . Since m was arbitrary and since {xn)n>i 
was a.s. arbitrary, we have established the upper bound ([T8l) for the case Dmin < D < Dave- 
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4) The case D = -Dmin-' We have established the lower bound ([17] ) for all D and the upper bound (fTSl ) 
for all D except for the case when D = Dj^m '■= D^i^{P, Q) and A^(P, Q, Dmin) < oo. We analyze that 
situation here. To simplify notation, we will suppress the dependence on P and Q whenever it is clear 
from the context. 

Define 



Because of (|7]), 



Qn+rn (^n+m = Qn+n^ {An{x1) X ^^(x^^+f )) 

and the mixing properties of Q give 

- \0gQn-\-m + logC 

< [-iogg„(A„(x^)) + iogC] 

+ [-logQ^ {AmKXr))+\ogC]. 

Lemma [8] shows that 

E [- l0gQ„(A„(Xr))] = nA:(P„, Q„, Anin) 

which we assume is finite, so we can apply the subadditive ergodic theorem and Proposition |5] to get 

lim -- logQ„(A„(Xn) = Ko{^, Q, /^min). (30) 

Note that if pq{Xi) is a.s. constant, then (5„(v4„(X]^)) = (5„(-B„(X", Z^min)) and (l30l) gives the upper 
bound. 

Now suppose pq{Xi) is not a.s. constant (and D = -Dmin and A*(Dmm) < oo). This is the only 
pathological situation where the upper bound does not hold. Our analysis makes use of recurrence 
properties for random walks with stationary and ergodic increments What we need is summarized in 
the following lemma: 

Lemma 15: Let (f/n)n>i be a real-valued stationary and ergodic process and define Wn := X]fc=i ^fc' 
n > L If EUi = and Prob{t/i ^ 0} > 0, then Prob {Wn > i.o.} > and Prob {Wn > i.o.} = 1. 

Proof: Define Wq := 0. {Wn)n>o is a random walk with stationary and ergodic increments. [11] 
shows that {liminf„n~^l^„ > 0} and {Wn — > oo} differ by a null set. The ergodic theorem gives 
Fioh{n~^Wn ^ 0} = 1, so Prob{11/„ — > oo} = 0. Similarly, by considering the process —Wn, we see 
that Prob{W„ -oo} = 0. 

Now {\Wn\ — i> oo} is invariant and must have probability or L If it has probability 1, then since we 
cannot have Wn — oo or Wn — — oo we must have Wn oscillating between increasingly larger positive 
and negative values, which means Prob{Vr„ > i.o.} = 1 and completes the proof. 

Suppose Prob{|iy„| oo} = 0. Define 

N{A) := ^ G A}, AcR, 

n>0 

to be the number of times the random walk visits the set A. [1] [Corollary 2.3.4] shows that either 
N{J) < oo a.s. for all bounded intervals J or {N(J) = 0} U {N{J) = oo} has probability 1 for all 
intervals J (open or closed, bounded or unbounded, but not a single point). By assumption \Wn\ oo, 
so we can rule out the first possibility. Since Probjiyo = 0} = 1, we see that for any interval J 
containing {0} we must have Prob{A^(J) = oo} = 1. In particular, taking J := [0, oo) shows that 

'*{W„)n>o is a random walk with stationary and ergodic increments [1] if Wo := and Wn '■= ^12=1 n > 1, for some stationary 
and ergodic sequence ((7„)n>i- 
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Proh{Wn > i.o.} = 1. Similarly, taking J := (0, oo) shows that Prob{iy„ > i.o.} = Prob{A^(J) 
oo} = Prob{iV(J) > 0} > Prob{f/i > 0} > 0. 
Returning to the main argument, 



1 f 1 

> — logQ J - VpQ(Xfc) < 

K. k=l 



(31) 

3 if iy„ > ' 

where Wn := J2k=iiPQi-^k) — -Dmin)- Lemma [TSl shows that Prob{l^„ > i.o.} > 0. This and (|3T1) prove 

Lemma \T5\ also shows that Prob{M/„ < i.o.} = 1. Let {Nm)m>i be the (a.s.) infinite, random 
subsequence of {n)n>i such that Wn < 0. Note that 



k=l 



< Dr. 



so 



< 



m 

-^logQ^„(A^„(Xf-)). (32) 



Now, the final expression in (|32l) is a.s. finite because £'[— logQ„(74n(X"))] = nA*(Dmin) < 00. This 
proves (|4b1 ) and shows that (A^m)m>i satisfies the claims of the theorem, including ©. Letting m — > 00 
in (|32l ) and using (|30l ) gives (|4c1 ). the upper bound along the sequence {Nm)m>i- Note that it also shows 
that the liminf„ is a.s. even in this pathological case. 

5) Replacing Ln with Rn-' Defining Z„ := p„(x",F"), Proposition [T] and Lemma [T2l show that 

Q„, D) = A:(5,?, Qn, D) < g„, D) 

and that i?„ and L„ are finite (infinite) together. Since we have already established that Ln{X^,Qn, D) 
and Rn{6x^,Qn, D) have the same lower bound (fTTl ), we can use the above bound to squeeze Rn when 
ever lim„L„ exists. 

In the only pathological situation where the limit does not exist, L„ converges along the subsequence 
where it is finite, so i?„ converges along that subsequence also. But as we noted above, L„ and i?„ have 
the same subsequence where they are finite. 
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